Basic Data Exploration

Using Pandas to Get Familiar With Your Data

在任何机器学习项目中，第一步是熟悉数据。您将使用Pandas库来完成这项工作。Pandas是数据科学家用来探索和操作数据的主要工具。大多数人在他们的代码中将pandas缩写为pd。我们使用以下命令来实现这一点：

python

import pandas as pd

Pandas库中最重要的部分是DataFrame。DataFrame 存储了你可能认为是表格形式的数据。这类似于Excel中的工作表，或者SQL数据库中的表。

Pandas拥有强大的方法来处理这类数据的大部分需求。

作为示例，我们将查看关于澳大利亚墨尔本房价的数据。在实践练习中，你将应用相同的过程到一个新的数据集，该数据集包含了爱荷华州的房价。

墨尔本的示例数据位于文件路径../input/melbourne-housing-snapshot/melb_data.csv。

我们使用以下命令来加载和探索数据：

python

# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	13580.000000	1.358000e+04	13580.000000	13580.000000	13580.000000	13580.000000	13518.000000	13580.000000	7130.000000	8205.000000	13580.000000	13580.000000	13580.000000
mean	2.937997	1.075684e+06	10.137776	3105.301915	2.914728	1.534242	1.610075	558.416127	151.967650	1964.684217	-37.809203	144.995216	7454.417378
std	0.955748	6.393107e+05	5.868725	90.676964	0.965921	0.691712	0.962634	3990.669241	541.014538	37.273762	0.079260	0.103916	4378.581772
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.182550	144.431810	249.000000
25%	2.000000	6.500000e+05	6.100000	3044.000000	2.000000	1.000000	1.000000	177.000000	93.000000	1940.000000	-37.856822	144.929600	4380.000000
50%	3.000000	9.030000e+05	9.200000	3084.000000	3.000000	1.000000	2.000000	440.000000	126.000000	1970.000000	-37.802355	145.000100	6555.000000
75%	3.000000	1.330000e+06	13.000000	3148.000000	3.000000	2.000000	2.000000	651.000000	174.000000	1999.000000	-37.756400	145.058305	10331.000000
max	10.000000	9.000000e+06	48.100000	3977.000000	20.000000	8.000000	10.000000	433014.000000	44515.000000	2018.000000	-37.408530	145.526350	21650.000000

Interpreting Data Description

结果显示了原始数据集中每个列的8个数字。第一个数字，计数，显示有多少行具有非缺失值。

缺失值可能由许多原因产生。例如，在调查一个只有一间卧室的房子时，不会收集第二间卧室的大小。我们将在稍后讨论缺失数据的主题。

第二个值是均值，也就是平均值。下面，std是标准差，它衡量数值的分散程度。

要解释最小值、25%、50%、75%和最大值，可以想象将每个列按从低到高的顺序排序。第一个（最小的）值是最小值。如果你走到列表的四分之一处，你会发现一个数字，它比25%的值大，比75%的值小。那就是25%的值（发音为”第二十五百分位”）。50%和75%的百分位定义类似，最大值是最大的数字。

Exercise:Explore Your Data

Step 1: Loading Data

Read the Iowa data file into a Pandas DataFrame called home_data.

python

import pandas as pd

# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

# Fill in the line below to read the file into a variable home_data
home_data = pd.read_csv(iowa_file_path)

# Call line below with no argument to check that you've loaded the data correctly
step_1.check()

Step 2: Review The Data

Use the command you learned to view summary statistics of the data. Then fill in variables to answer the following questions

python

# Print summary statistics in next line
home_data.describe()

Id	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	…	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
count	1460.000000	1460.000000	1201.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1452.000000	1460.000000	…	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	730.500000	56.897260	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.685262	443.639726	…	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753	180921.195890
std	421.610009	42.300571	24.284752	9981.264932	1.382997	1.112799	30.202904	20.645407	181.066207	456.098091	…	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095	79442.502883
min	1.000000	20.000000	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	…	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000	34900.000000
25%	365.750000	20.000000	59.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	…	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000	129975.000000
50%	730.500000	50.000000	69.000000	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	…	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000	163000.000000
75%	1095.250000	70.000000	80.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	166.000000	712.250000	…	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000	214000.000000
max	1460.000000	190.000000	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	…	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000	755000.000000

结果显示了原始数据集中每个列的8个数字。第一个数字，计数，显示有多少行具有非缺失值。
缺失值可能由许多原因产生。例如，在调查一个只有一间卧室的房子时，不会收集第二间卧室的大小。我们将在稍后讨论缺失数据的主题。
第二个值是均值，也就是平均值。下面，std是标准差，它衡量数值的分散程度。
要解释最小值、25%、50%、75%和最大值，可以想象将每个列按从低到高的顺序排序。第一个（最小的）值是最小值。如果你走到列表的四分之一处，你会发现一个数字，它比25%的值大，比75%的值小。那就是25%的值（发音为”第二十五百分位”）。50%和75%的百分位定义类似，最大值是最大的数字。

python

# What is the average lot size (rounded to nearest integer)?
avg_lot_size = round(home_data['LotArea'].mean())

# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 2024 - round(home_data['YearBuilt'].max())

# Checks your answers
step_2.check()

Analyze:

Question:What is the average lot size (rounded to nearest integer)?

这个问题在翻译时，可能会错误的翻译为**“平均地块大小（四舍五入到最近的整数）是多少？”**从而导致无从下手，实际上这个问题的翻译应该是 lot地区的平均面积是多少(四舍五入到最近的整数)

在前文中我们使用pandas导入了训练数据集home-data-for-ml-course/train.csv，并在下一行打印了统计信息。

python

import pandas as pd

# 数据集的读取路径
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

# 将文件读入一个名为 home_data 的变量中
home_data = pd.read_csv(iowa_file_path)

# Print summary statistics in next line
home_data.describe()

显然，我们需要读取出特定行列的值，如何实现呢？

坦白的来说，并不存在特定的行，因为我们所看到的行名，比如说count，mean，std……实际上是被计算完成后的统计量，在使用时，我们实际上只能指定列名，但幸运的是，我们也不需要找到特定的一行，而是得到一些统计上的特征量

python

# home_data 是Pandas中的DataFrame，home_data['LotArea']则指定了该DataFrame中名为LotArea的列，而 mean()方法则用于计算均值，同样的 max()方法用于计算最大值
# 所以 What is the average lot size (rounded to nearest integer)?
avg_lot_size = round(home_data['LotArea'].mean())
# As of today, how old is the newest home (current year - the date in which it was built)
# 距今（2024），YearBuilt列中最大的，就是最晚建造的,也就是最新建造的
newest_home_age = 2024 - round(home_data['YearBuilt'].max())